Runbook: Flux Reconciliation Failure
Alert
- Prometheus Alert:
FluxReconciliationFailure - Grafana Dashboard: Flux CD dashboard
- Firing condition: A Flux Kustomization or HelmRelease has been in a failed or not-ready state for more than 15 minutes
Severity
Warning -- Reconciliation failures mean the cluster state has drifted from the Git repository. Changes committed to Git are not being applied. Extended failures may indicate a broken deployment or dependency issue.
Impact
- New deployments or configuration changes from Git are not applied
- Platform components may be running stale configurations
- If a core component fails (Istio, Kyverno, monitoring), downstream services may be affected
- Security patches committed to Git are not being rolled out
Investigation Steps
- Get the status of all Flux Kustomizations:
flux get kustomizations -A
- Get the status of all HelmReleases:
flux get helmreleases -A
- Identify the specific failing resource and check its events:
flux logs --kind=Kustomization --name=<name> --namespace=flux-system
flux logs --kind=HelmRelease --name=<name> --namespace=<namespace>
- Check the Flux source-controller for Git repository sync issues:
flux get sources git -A
kubectl logs -n flux-system deployment/source-controller --tail=100
- Check the Flux helm-controller for Helm-specific errors:
kubectl logs -n flux-system deployment/helm-controller --tail=100
- Check the Flux kustomize-controller:
kubectl logs -n flux-system deployment/kustomize-controller --tail=100
- Verify Flux system pods are running:
kubectl get pods -n flux-system
- Check for resource conflicts or validation errors:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
- Check if the HelmRelease has dependency issues:
kubectl get helmrelease <name> -n <namespace> -o yaml | grep -A 10 dependsOn
Resolution
HelmRelease stuck in "not ready" due to failed upgrade
- Check the Helm history:
helm history <release-name> -n <namespace>
- If a bad revision exists, let Flux retry:
flux reconcile helmrelease <name> -n <namespace> --with-source
- If retries are exhausted, reset the release:
flux suspend helmrelease <name> -n <namespace>
helm rollback <release-name> <last-good-revision> -n <namespace>
flux resume helmrelease <name> -n <namespace>
Kustomization failing due to invalid YAML
- Check the error message in the Kustomization status:
kubectl get kustomization <name> -n flux-system -o yaml | grep -A 5 'message:'
- Fix the YAML in the Git repository
- Push the fix and force reconciliation:
flux reconcile source git sre-platform -n flux-system
flux reconcile kustomization <name> -n flux-system
Git source not syncing
- Check the GitRepository status:
flux get sources git -A
kubectl describe gitrepository sre-platform -n flux-system
- Verify Git credentials are valid:
kubectl get secret flux-system -n flux-system -o yaml
- Test connectivity from the cluster to the Git repository:
kubectl run -n flux-system --rm -it --restart=Never curl-test --image=curlimages/curl:8.4.0 -- curl -I https://github.com
Dependency failure cascading
If component B depends on component A, and A is failing:
- Fix component A first
- Then reconcile B:
flux reconcile helmrelease <component-a> -n <namespace-a>
# Wait for A to become ready
flux reconcile helmrelease <component-b> -n <namespace-b>
The dependency chain is: istio-base -> cert-manager -> kyverno -> monitoring -> logging -> openbao -> harbor -> neuvector -> keycloak -> tempo -> velero
HelmRelease stuck with "another operation in progress"
- Check for stale Helm secrets:
kubectl get secrets -n <namespace> -l owner=helm
- If a pending install/upgrade secret exists, remove it:
kubectl delete secret sh.helm.release.v1.<name>.v<version> -n <namespace>
- Resume reconciliation:
flux reconcile helmrelease <name> -n <namespace>
Prevention
- Always run
task lintbefore pushing changes to Git - Use
flux diff kustomizationto preview changes before committing - Pin exact chart versions in HelmReleases (never use
*or ranges) - Monitor
gotk_reconcile_conditionmetric in Prometheus for early drift detection - Set up Grafana alerts on Flux reconciliation duration and failure count
- Test HelmRelease changes in a dev environment before promoting to production
Escalation
- If Flux system pods are crash-looping: escalate to platform team immediately
- If Git source is unreachable for more than 30 minutes: check network/firewall rules and Git hosting service status
- If multiple HelmReleases fail simultaneously: likely a shared dependency issue -- start from the root of the dependency chain